Simulation-based Regularized Logistic Regression 
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Abstract 

In this paper, we develop a simulation-based framework for regularized logistic re- 
gression, exploiting two novel results for scale mixtures of normals. By carefully choos- 
ing a hierarchical model for the likelihood by one type of mixture, and implementing 
regularization with another, we obtain new MCMC schemes with varying efficiency 
depending on the data type (binary v. binomial, say) and the desired estimator (max- 
imum likelihood, maximum a posteriori, posterior mean). Advantages of our omnibus 
approach include flexibility, computational efficiency, applicability inp >n settings, 
uncertainty estimates, variable selection, and assessing the optimal degree of regular- 
ization. We compare our methodology to modern alternatives on both synthetic and 
real data. An R package called reglogit is available on CRAN. 

Key 'words: Logistic Regression, Regularization, z-distributions, Data Augmenta- 
tion, Classification, Gibbs Sampling, Lasso, Variance-Mean mixtures, Bayesian Shrink- 
age. 



1 Introduction 



X 
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Large scale logistic regression has numerous modern day applications from text classification 
to genetics. We develop a flexible framework for maximum likelihood, maximum a posteriori, 
and full Bayesian posterior inference for regularized models. Our motivations stem from a 
desire to find common ground between point estimation in "large-p" settings ( Krishnapuram 



et al. , 2005 Genkin et al. , 2007), where p is the number of predictors, and full Bayesian 



2006 



Friihwirth-Schnatter and Fruhwirth, 



2007 



inference for "small-p" (Holmes and Held 

Friihwirth-Schnatter et al.l 2009; Fahrmeir et al.l 120101 IFruhwirth-Schnatter and Fruhwirth 



1} £ 



2010). Collecting such distinct methods into a unifying framework facilitates a number o 



novel enhancements including posterior inference for the amount of regularization, and an 
efficient handling of binomial data. 



*Part of this work was done while RBG was at the Statistical Laboratory, University of Cambridge 



We start by framing a typical regularized optimization criteria as a powered-up poste- 



rior, or power-posterior (Friel and Pettitt , 2008), with a shrinkage prior such as the lasso 



(Tibshirani, 1996). We then show how inference may proceed by employing two (heretofore 



unrelated) data augmentation schemes: one for the powered-up logistic likelihood; and the 
other for the prior. The combined effect is a fully Gibbs MCMC sampler which, among other 
advantages, allows estimators previously requiring custom algorithms to be calculated via a 



single simulated annealing (Kirkpatrick et al 

Specifically, consider a set of binary responses, y^ 
dimensional predictors X{ via the logistic model P(y$ = 



1983) scheme. 

encoded as 
±l\xi,/3) = 



±1, regressed on p- 
(1 + e -2/i*7/3)-i for 



n. 



When p is large it is paramount to infer /3 under regularization or penalization. 



A common formulation (e.g., Genkin et al. , 2007 Park and Hastie, 2008) involves finding 



regularized point-estimates under an L a -norm penalty, where parameters a 2 
control the relative penalization applied to each predictor 



cr 



i- 



cr: 



(5 = argmin^ 



n p 



P., 



(T; 



a > 0. 



The parameter v dictates the amount of regularization, the relative pull [y x ) or shrinkage 
of the /3j-'s towards zero. Depending on the choice of a, a number o f algorithms have been 



proposed to solve for j3. For example, Madigan and Ridgeway (2004) discuss how the LARS 
algorithm can be useful as a subroutine for the popular case of a = 1. It is typical to work 
with Xj pre-scaled to have unit L2-norm with o~\ — ■ ■ ■ cr p — 1 so that inference for is 
equivariant under a re-scaling of the covariates. We follow this convention in application 
but develop much of the discussion in the general case for completeness. The special setting 



o~j = oo indicates no shrinkage for f3j. 



At least max{0,p — n} of the cxj's must be finite 
to obtain stable estimators. If there is an intercept in the model, denoted by /3o, then it 
is common practice to absolve it of penalty by taking a^ = oo. Throughout we begin the 
j-indexing at j — 1, ignoring the th term for simplicity. 

Our approach offers a fully probabilistic alternative by viewing the objective function 
(fTl) as a (log) posterior distribution whose maximum a posteriori (MAP) estimator coincides 
with 0. A multiplicity parameter k can then be introduced to help find the MAP via simu- 
lation. Our key insight, which makes the simulation efficient, is that the logistic likelihood 



component of the posterior can be written hierarchically using ^-distributions (Barndorff- 



Nielsen et al. , 1982), leading to a data augmentation scheme that generalizes that of Holmes 



and Held (2006) [HH hereafter]. Combining this with a standard data augmentation for the 
prior yields a highly blocked Gibbs MCMC algorithm for logistic regression. Z-distributions 
also suggest a new representation of the likelihood that is equivalent (to HH) but requires 
n fewer latent variables. Finally, we recognize that k has a secondary use for binomial data 
(multiple y observed for each x) which otherwise would require more latent variables. 

A distinctive feature of our framework is how it deals with the amount of regularization, z/, 
which is traditionally chosen by cross validation (CV). As an alternative, we may extend the 
hierarchical model to include a prior for v so that the marginal likelihood can be computed 
and used to set v = z>, or to integrate v out. Posterior expectations, thus obtained, can give 



superior point-estimators for (3 in large-p linear regression contexts (Hans, 2009), and we 
show how this extends to logistic regression. 

The rest of the paper is outlined as follows. Section [2] provides our data augmentation 
strategies for sparse high dimensional logistic regression, and Section [3] develops an MCMC 
scheme for estimation. Section [4] illustrates our approach with empirical comparisons to 
modern competitors. Finally, Section [5] concludes with simple extensions and directions for 
future research. An supporting R package, reglogit, is available on CRAN. 



2 Regularized logistic regression via power-posteriors 

The central problem is to find the MLE, MAP, or posterior mean estimator in logistic 
regression. To do this, consider the following power-posterior distribution inspired by Eq. (jTl): 



TTk 



\V,V; 



° 2 ) = C K>a {u) exp J -k I JT In (l + e-**?^ + 






Pi 

(7; 



(2) 



The placement of k and a as subscripts in 7r KiCf and C K)(X (u), a normalization factor, signals 
that these are user specified, not parameters to be estimated. The a setting indicates the 
type of L a regularization, e.g., L\ for absolute, and L 2 for quadratic. The multiplicity (or 
thermodynamic) parameter k, is a tool borrowed from the power-posterior and simulated 



annealing literature (see, e.g., Pincus 1968 Kirkpatrick et al. 1983 Doucet et al. , 2002 



Jacquier et al. , 2007 Friel and Pettitt, 2008), that facilitates several types of simulation 



based inference, as we shall describe. 

Power-posterior analysis can be helpful for calculating modes and posterior means from 
complex optimization criteria, and marginal likelihoods for Bayesian estimators. Larger 
values of k cause the density to concentrate near the modes, whereas small k distributes 
it away from the modes, in the troughs. This motivates two types of estimator. First, 
E KiQ {/3|y, a 2 , u} can be estimated for choices of v by allowing k to vary as in simulated 
annealing. When v — the estimator converges to the MLE as k — > oo. When v > 0, it 
converges to a posterior mode, or equivalently the regularized estimator, J3 solving Eq. (II). 
Furthermore, setting k — 1 yields the posterior mean estimator. Second, we recognize that 
K can be used to obtain an efficient computational framework for binomial regression, where 
multiple binary responses are recorded for each predictor. In what immediately follows, we 
regard k as fixed — a further discussion is deferred to Section [3j 

Observe that the likelihood-prior combination below yields Eq. (J2l) via Bayes' rule. 
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The following subsections provide data augmentation schemes for this likelihood and prior. 
They primarily concentrate on the a = 1 case, i.e., the double-exponential or lasso prior, 
although results are developed in generality when possible. Section [5] briefly touches on the 
simpler a = 2 case. 

2.1 Hierarchical representation of the logistic 



Extending a well-known technique for generating logistic regression (e.g., Andrews and Mal- 
lows, 1974 Holmes and Held 2006), we represent the powered-up likelihood (|3j) for f3 



as a marginal quantity obtained after integrating over latent variables (z,X), where z 
(zx,...,z n ) and A = (Ai,...,A n ). That is, each element of the product of independent 
logistic terms can be written as a two-dimensional integral: 

n t*co /*oo 

L n(y\(3) = T\ / PK(zi\P,*i,yi)PK(\)dXidzi. (4) 

i=1 Jo Jo 

This suggests a hierarchical representation in terms of latent variables, Zj for each i/i, mixed 
over Aj. It remains to determine the appropriate form of p K (zi\j3, K,Vi) and p K (\i) so that 
(l + e-y^^- K = JJp K {z i \/3,X i ,y i )p K {X i )dX i dz i ^\ 

Our key result, generalizing HH, relies on a scale mixture representation of ^-distributions 



(Barndorff- Nielsen et al. , 1982). These are characterized by their pdf as: 



]_ e a(z-/i)/cr 

Z(z:a,b,a,u) = f z (z\a,b,a,a) = —— — — — -. — ,, . , , (5) 

where q a ,b(X) is a Polya distribution, i.e., an infinite sum of exponentials: 

oo 

q a ,bW = Ys Wke ~^ kX where ^k = {a + k){b + k), (6) 

k=0 

and the weights w^ are determined via 5 = (a + b)/2 and 9 = (a — b)/2 as 

/_2A (S + k) = (-l) k (25)...(25 + k-l) (5 + k) 
Wk \ k jB(5 + e,S-9) k\ B(5 + 9,5-9)' [) 

This prior has a simple generative form: 

oo 

A = ^2^ 1 e fc , where e k ~ Exp(l). (8) 



k=0 



Then, each component (l+e yx ^) K of the likelihood (dropping i subscripts) can be written as 
the cumulative distribution function (cdf ) evaluation (at zero) of a particular ^-distribution. 



1 The notation reserves tt(-) for the marginal posterior j3 as a visual queue for the quantity of primary 
interest. All other probability densities use p(-), including the joint for latent (z, A) and all priors. 



Theorem 1. The (powered up) logistic function may be represented as follows. 

z-yx T /3- -(1-k)A J \ q hK (X) dXdz. (9) 



roof. If z ~ Z(l, k, 1, yx T /3), then F z (z) = 1 - (1 + e z ~ yx ^)~ K 



Proof. If z ~ Z(l, «, 1, yx T /3), then F z (z) = 1 - (1 + e 2 "^ ^)" K , giving 1 - F z (0) = (1 + 
e~ yx /3 ) _K . In other words, 

h +e -^ M = / Z(z;l,K,l,yx T p)dz, (10) 

establishing the outer integration, over z, in Eq. Q. Applying the representation in Eq. ^ 
yields the desired result. □ 

The statistical implication of this is a hierarchical model which we summarize in the 
following corollary. 

Corollary 1. The conditional distribution p K (zi\(3, Aj, yi) and the mixing distribution qx )K (\i) 
imply that the latent Zi follow 

p K {zi\/3, x h Vi ) = at + Ltxjp + l -{i - K)x h xA , (ii) 

where A/" + is the normal distribution truncated to the positive real line. 

In more compact notation, z\(3, \,y ~ Af^((y.X)/3+^(l— k)X, A), where y = (yi, . . . ,y n ) J , 
y.X = diag(y)X, A = diag(Ai, . . . , A n ), and the truncation is to the all-positive orthant. Ob- 
serve that, when k — 1, the above formulation is identical to the generative model described 
by HH. Given predictors Xi and regression coefficients (3, generate y, G { — 1,-1-1} as 

00 2 

y % = sign(^), where z { ~ Af(xJ/3,Xi) and \ { = Y^ e k , t k ~ Exp(l). (12) 

When k > 1, the asymmetry of the z-distribution makes it harder to extract yi from yixj (3 + 
o(l — n)Xi, the mean of the truncated normal in Eq. (11). In Section 3.3, we indirectly suggest 



that one can interpret Kyi as a binomial response when k is an integer. 

An alternative z— representation: 

Theorem [I] shows how components of the powered-up logistic likelihood can be represented 
hierarchically by the cdf of ^-distributions. We therefore call that multiplicity extension to 
HH the cdf representation. However, further inspection reveals that it is possible to eliminate 
an integral in Eq. (El) and thus n latent variables, and use the representation 

(1 + e*'-"*)-* = Z(zi] a = 0, b = k, 1, &) 



which avoids integrating over Zj. Instead, set them to zero (and /ij = y^xj (3) and directly 
obtain (l + e ViXi /3 )~ K . By analogy, we call this a pdf representation as it involves evaluating a 
particular z-density function. This simple representation is problematic, however, since the 
Polya mixing density qo yK is improper. In particular, note that ipo = 0, resulting in a infinite 
weight in the generative formulation (J8J). 

Fortunately, a similar representation may be generated 



vl + e - 



il\-K 



Z(z; a, b, l,/i) 
1 



afi 



V2^X 



2=0 



cxp 



1 

2A 



-fi--(a-b)X) }q a , b (X)d\ (13) 



whi ch involves a proper Polya mixing density as long as (a, h) > and a + b = k. In Section 
|3.l| & |4j we show how the extra e aM = e ayiXi & poses no problem for efficient inference, and 
that (a = |,6 = K — |) works well in practice. But first, we complete the power-posterior 
specification with a family of regularization priors on (3. 



2.2 Prior regularization 

Regularization is achieved via a family of priors, p Ka (j3\v, a 2 ), implementing the L Q -norm via 



the decomposition p Kya ((3j \ u, a 2 ) = J p K ,a{j3j\ojj, u, a 2 )p a {ujj) du)j, following Carlin and Poison 
(1991) and Park and Casella (2008) in regularized (Bayesian) linear regression context. The 



idea is that, given f3j 



,i/ Q °i 



Ujtj 



iid 



and €j ~ A/"(0, 1), small v (i.e., heavy regularization) 



and large k (i.e., heavy concentration of power-posterior density around the mode at the 
origin) both shrink /3j towards zero. We provide p a (oo) yielding the desired regularization 
penalty which, after unpacking factors from C K}OL (v) in Eq. d3J), is 



Pk,< 



W,o- 2 ) = Y[p K , a Wj\ u itf) K v pK exp -k^2 



Pi 



VGa 



(14) 



Box and Tiao (1973) provide a general discussion of (14) in the linear regression context 



Some notable special cases in the recent literature on sparse logistic regression include the 
following: when v — 1, a — 1, and Oj = A, it is the Laplace prior used in IGenkin et al. 



(2007); when a 
and v~ l 



2, a,-, 



1 and v = a 



A it is the Laplace prior from Krishnapuram et al. 



it is the Gaussian prior, and when a = 2, <Tj 



1 
Inference for v in 

these cases typically proceeds by CV, or by inspecting the paths of (3 U solutions for varying 
v. Assessing the uncertainty in estimators (3$ on the final choice of v can pose difficulties. 

Power posterior analysis offers an intriguing, third, option by providing the potential for 
tractable marginalization over prior uncertainty v ~ p K ,iAy\ Two particular choices in the 
a = 1 case lead to efficient inference by Gibbs sampling [Section 3.1 . One option is an 



2 The Xj and A variables correspond to the shrinkage parameters so named in our references. They should 
not be confused with the latent A^ used in our hierarchical likelihood representation. 



inverse gamma (IG) prior for v 2 with shape r K = k{t + 1) — 1 and scale d K = nd, where 
K — 1 yields a base case IG(^ 2 ; r,d) prior. The second option is IG for u, with identical 
powering-up identities. It has lighter tails in z/ -1 , thus providing more aggressive shrinkage. 
The prior in Eq. (14) — for the purposes of efficient inference [Section [3] — is an adaptation 
of a scale mixture of normals result from West (1987) to account for k. Specifically, 



PwiPjWitf 



AfiPi-Au* 



,V| 



K 



2/a 



p a (uj) du)j, 



(15) 



o 

where p a {u>j) oc u- 2 St|(a;^~ ) and St+, 2 is the density function of a positive stable random 

variable of index a/2. In compact notation, (3\a 2 ,ui, u, k ~ J\f p (0, i/ 2 //£ 2//a £f2) where S = 

p). An important corollary, obtained by adapting an 

(1974) result, is that if a = 1, Uj ~ Exp (2), and Uj — 1 for j = 1, . . . ,p 



diag((7 2 , . . . , a 2 ) and Q = diag(o;i 



,w r , 



Andrews and Mallows 



then p K (/3\v) is double exponential (Laplace) with a mean zero and scale z/ 2 /k 2 . 



3 Simulation-based logistic regression 



We develop a Gibbs sampling algorithm [Section 3.2| for sampling the augmented power- 
posterior p K (0, z,lj, A, u\y, a 2 ), for any k. We first derive the relevant posterior conditionals 
[Section 3.1 , treating cdf and pdf representations in turn. When k — 1 the marginal samples 
of /3 summarize the posterior distribution of the main parameters of interest. Obtaining the 
MAP or MLE requires an inhomogeneous Markov chain [Section 3.2] . Finally, we describe 



how a vectorized k can facilitate efficient Bayesian binomial regression [Section 3.3 



3.1 Posterior conditionals 

To begin, consider the latent z and A variables in the cdf and pdf representations, in turn, 
followed by the coefficients /3 and corresponding regularization prior parameters (u>,v). 



Latent likelihood parameters (z, A) 



By construction [Eq. (11) of Corollary 111, the posterior full conditional for the latents, 
p K (zi\{3,\i,yi), is a truncated (non-negative) normal distribution. Obtaining samples, in- 
dependently for i = 1, ... ,n, is straightforward following the methods of Robert (1995). 



Sampling from the full conditional p K (\i\/3, Z{, yi) is complicated by the infinite sum in the 
expression for the prior (J6l), which precludes a naive approach via truncation since certain 
combinations of A« and b = k can give highly inaccurate, even negative, evaluations. HH 
derive an expression for this conditional when k = 1 and provide a rejection sampling 
algorithm by squeezing ( |Devroye 1986). Although adaptable for general k, we prefer a 
Rao-Blackwellized approach. Interchanging the order of integration in Eq. (|9| suggests a 
corollary to Theorem [I] that is helpful in constructing a Metropolis-Hastings (MH) scheme 
for obtaining Aj draws. 



Corollary 2. The following is an alternate integral representation of the logistic function 



cxp 



{-«ln( 



l + e 



$ 



qi, K (Xi)d\i 



'o v y\ 

where $ is the cdf of the standard normal distribution. 
Proposals \' t ~ 9i, re (A) can then be accepted via MH with probability min{l, A{\ where 



A; 



<j>{{-y iX lP-\{\-K)K)ly/K} 

^{-yixJp-Ul-K^/sfXi} 



(16) 



Good proposals may be obtained by truncating the sum in Eq. ([8]) at K — 100 for k = b = 1 



with improvements for larger k. Direct sampling is also possible (e.g., Weron, 1996). 

Empirically, the MH acceptance rate is high (> 90%) for k = 1 because posterior is similar 
to the prior (#1,1). Therefore the MH scheme may be preferable to the rejection/squeezing 
method of HH who report acceptance rates as low as 25%. Both rates decline as k is 
increased, but the MH rate is still above 1% for k = 20. A good rule of thumb is to thin \k] 
draws for each draw saved, which is reasonable from a computational standpoint as sampling 
from q a fi is fast. Even when thinning more than 10-fold, the MH sampler is competitive to 
HH/Devroye in terms of sheer speed. The MH requires two $ evaluations, a few arithmetic 
operations, and two square roots. HH/Devroye, by contrast, can perform dozens (or more) 
expensive operations such as pow before the squeeze is made. Finally, drawing A, unconditional 
on Zi yields lower autocorrelation in the overall joint MCMC sampling scheme. 

The pdf representation is simpler since Z{ is set to zero. Proposed A^ ~ q a ^ may be 



accepted or rejected via MH by exchanging a cdf for a pdf in Eq. (16) and replacing |(1 



with |(o — b). Another feature that works well for the pdf representation is an adaptation 
of the slice sampler of Godsill (2000). Given Aj, the next sample A^ may be obtained via 
an auxiliary uniform random variable as follows. Let <pi = <f){(—yixj (3 + |(o — 6)Aj)/a/A7}, 
where <fi is the pdf of a standard normal distribution. Then sample 



u\Xi,Xi,yi,j3 ~ U[0,(pi} 



followed by 



A-K xt, yi, (3 ~ g aife (A-)I { ^ >u} , (17) 



where the second step is facilitated by accept /rejects following random draws from the Polya 
mixing density. Although more automatic in that it does not require thinning, we show in 



Section ^2 that the MH scheme is faster overall. The two methods behave similarly when k 
gets large, causing the rate of rejections/required thinning to increase. 



Regularized regression coefficient parameters (/3, u, v) 

In 



2.2 



;he cdf representation, the multivariate normal priors for z [Section 2.1 and [3 [Section 
combine to give j3\z, uj, A, u, n ~ Np($, V) with hyperparameters 



T a-1 



P = V(y.X) ' A 



1-«)AJ, and V- 1 = (u/K 1/a )- 2 ^- 1 Q- 1 + (y.X) T A-\y.X). 



Obtaining V from V 1 is generally 0(p 3 ), which represents a significant computational bur- 
den in the p>n context. By employing the Sherman-Morrison- Woodbury formula (e.g., 



Bernstein, 2005, pp. 67), it is possible to use an 0(n ) operation instead, which could rep- 



resent significant savings. In the pdf representation a similar combination of regularization 
penalties and likelihoods gives an identical V~ x expression, but a new (3 = (a— \[a — b])VX T y 
[see Appendix A . Choosing (a = |, b — K — |) gives (3 = ^VX T y, a particularly simple ex- 
pression that may be used for k > |. It is interesting to observe that the parameters (A, u, v) 
only enter into the conditional for /3 through V in the pdf representation. 

The full conditional distribution of each latent ujj is proportional to the integrand of 



Eq. (15). When a = 1 we have the following adaptation of a standard result. 



Corollary 3. For a = 1, the full conditional distribution of the reciprocal of uj- 1 follows an 
inverse Gaussian distribution: ujJ 1 ]^,^, k ~ IN(^| ^-j" 1 , 1). 



Proof. From the integrand in Eq. ( 15 ) with a = 1 we have 



P^UjlPj,!/) OC 



y/2nuj ' I - 



u 2 a]uj 



+ Ui 



( 1 * 2 /3? 
GIG w,;r,l 






? 1 n ' ? 9 9 

2 v 1 oi 



which is implies that u; • x ~ IN ( ^ ^ , 1 j . [See Appendix 



B 



for IN/GIG definitions] 



□ 



Our IG priors for v are both conditionally conjugate. An IG prior for v 2 and the rep re- 



sentation in Eq. (15) gives 



v 2 \(3, u, k ~ IG ( r K + — , 4 + — , 

1 l j= i a j u i . 



E 



An IG prior for v leads to efficiency gains (in addition to better tail properties) since there 



is no conditioning on ui. Using Eq. (14) directly in then gives 

( P 

u\/3, K ~ IG \r K + up, d K + k y^ 

V 3=1 



A 



O"; 



extending the analysis of Park and Casella (2008). 



3.2 Gibbs sampling and annealing for point estimators 

A full Gibbs sampling algorithm for both cdf and pdf representations is outlined in Figure 
nl For compactness, variations with slice sampling for A [in the pdf case] or a prior on v 2 are 
not shown. The former requires replacing each iteration of step 2 by the method surrounding 

Eq. (17). The latter requires drawing v 2( - s > 



,2(») 



IG [r K + f , 4 + \ E?=i ^w in step 5 and 



specification of u 2 ^ on input. Initial latent u>j values are not required. 



Inputs: 

• Data: n x p response-multiplied design matrix y.X 



Settings: multiplicity k > 0; scale factors a\, . . . ,a p where E = diag(cr^, . . . , az); repre- 
sentation type R G {cdf , pdf}; Polya parameters (a, 6) where (a = 1, 6 = k) if i? = cdf or 
(a, 6) > and a + b = k, otherwise; prior parameters (r K , d K ) > 0; sample size S 

Initial values: p<® = (/3j 0) , . . . ,/3£ 0) ) T , i/°), latents A<°) = diag(A[ 0) , . . . , xi? ] ), and if 
R = cdf also include latents z^ ' = (z[ , . . . , Zn ) T 



Gibbs sampling, for iterations s = 1, . . . , S: 



1. For ; = 1, . . . ,p take a;" 1 ~ IN M^= 



(«-i) 



,(«-!) 



-1 



1 , and let ^ = diag(wj s \ 



X»h 



. . , wp 



2. For i = 1, . . . , n do the following depending on the representation i? 

• propose A'j ~ g a , b approximately via (8) as A- = ^fe=i ( a +k)(b+k) > wnere e fe ~ Ex p(!) 
and if large 

• draw u ~ Unif (0, 1) and if u < Ai where 



A, 



*{(-j/i^ ( a( s - 1 )-|(i-«)A| s - 1) )/ v / Ap iy 



if R = cdf 

otherwise 



0{(-^ /3(s -i)_i (a _ b)A (-i) )/ y A pT)j 

then take X\ s = X' i: or take \\ s = X\ s otherwise. 
Then let A^ = diag(AJ s) , . . . , X { p s) ) and \W = (x[ s) , . . . , X p s) ) T 

3. If R = cdf then for i = 1, . . . ,n draw z\ s) ~ M + (yixjp^- 1 ^ + 1(1 - k)aJ s) , Aj s) ) , and 
collect them as z^) = (z± , . . . , Zn ) 

4. • Calculate V -1 W = (lA'ty/c^E - ^- 1 ^) + (y.X) T A- 1 ( s )(y.X). If # = cdf then 

calculate /3 = VW(y..X') T A- 1 W (z( s ) - 1(1 - k)X^), otherwise 0W = (o - |[a - 

• Draw/3W~AA p (/3( s ),W s )) 



5. Draw i/W ~ IG ( r K + Kp, d K + k Y%=i 



o(s) 



Output: {/?( s )}f =1 , {z»}f =1 , latents {A( s )}f =1 , and if R = cdf also include latents {^( s )}f =1 
Figure 1: Pseudocode for simulation based regularized logistic regression. 
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The samples on output may be used to approximate expectations under the power- 
posterior distribution with multiplicity k. If K — 1 then these are samples from a well- 
defined posterior distribution which may be used, e.g., to approximate the posterior mean 
of /3 or provide samples from the posterior predictive distribution. Both take into account 
the full the uncertainties of all parameters (including v) into account — a feature unique to 
full Bayesian analysis. 

Settings of k > are useful for finding other popular estimators via simulated annealing 
(SA). In our context, SA establishes an inhomogeneous Markov chain over a sequence of 
power-posteriors, starting with k = 1 and then increasing according to a pre-determined 
schedule. Except when Gibbs sampling is possible for all k, (as for our power-posterior), it is 
usually difficult to ensure that the Markov chain mixes well, particularly when k increases. 
A pragmatic approach starts at k ~ 1, and systematically makes modest increases in k until 
Monte Carlo variation in the power-posterior expectations of the quantities of interest is 
below a pre-determined threshold. Each annealing iteration is initialized with the last value 
/?0% i*( s \ \( s ) and z^ s \ from the previous iteration, thereby stitching the inhomogeneous 
Markov chains together. The chain for each k must have enough iterations to establish 
convergence to its particular power-posterior. 

Annealed procedures such as ours present an MCMC alternative to EM-style algorithms. 
Importantly, SA is known to converge to the global optima in certain conditions (when 
k, — y oo), whereas EM is only guaranteed to find a local optima. Although convergence 
for EM is usually quick, there are no guarantees that it will be so and indeed there are 
examples, particularly in high dimensional settings, where convergence can be arbitrarily 
slow. SA however, comes with the burden of choosing the schedule for increasing k. We 
have found that for our regularized logistic regression scheme, convergence is fast and mixing 
so good that short schedules such as k = 1,5, 10,20 are a safe default [see Section El. Even 
jumping immediately to modest n (m 20), skipping k — 1, can very often yield cheap and 
accurate approximations. 

But perhaps the most noteworthy difference between our simulation approach and previ- 
ous methods (like EM) are the myriad of options (beyond CV) for inferring v. One option, 
in the classical context, is to use annealing to find the joint mode of (/3, v). Another option 
is to first use samples from the posterior marginal p(u\X, y) to estimate the posterior mean 
v = E{z/|X, y}, and then proceed to estimate /3 = E K {/3|i>, X, y} as before. In Figure [lj this 
would be facilitated by inputting i/(°) = v and replacing step 5 with v^ 8 > = v^ l \ Our expe- 
rience is that the former works well for small p problems, and the latter for large p. When 
p is large, the joint prior for (/3, v) dominates near the posterior mode of v, which tends to 
zero and yields /3 = 0, which is not helpful. The marginal posterior mean is far less sensitive 
to the regularization prior, and represents a more convenient choice for large p applications. 
In Section |4~4 we provide an example where the joint mode is easy to find with a few dozen 



predictors, whereas an interaction expanded version using thousands requires more care. 
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3.3 Efficient handling of binomial data 

Another advantage of our approach is the extension to binomial data, where binary responses 
are collected repeatedly and independently, rii times for subjects with the same covariates X{. 
Contingency tables are one important example. A typical (un-regularized) logistic regression 
model is y^Xi ~ Bin(n,, /ij), where fa = e r,i /(l + e' ni ) and rfo is linear in X{. One way to situate 
such data within this article's regularized logistic regression framework is to flatten it, so 
that rii components appear in the likelihood for each subject i: n?=i(l + e~ VijXi /3 ) K , using 
the binary encoding yij G { — 1, 1} giving | YTj=i Vij\ = n i- This allows inference to proceed as 
described in Section [3j but it can lead to an inefficient MCMC scheme if the rii are large due 
to the rii latents required for each i. It turns out that it is possible to use only two latents 



for each i, echoing a feature of methods described by Friihwirth-Schnatter et al. (2009). 

Observe that the component of the likelihood for subject i may be equivalently written 
with just two terms as (1 + e~ Xi /3 ) K ^(1 + e Xi PY^ ni ~ yi \ which is proportional to the i th 
component of a typical binomial likelihood with logit link. Hence the full likelihood, with 
m unique subjects, can be written as nili(l + e ~ x% l3 ) Kl+ (l + e Xi /3 ) Ki -, where n i+ = ny^ and 
«i_ = n{rii — yi). This is identical to a z-distribution representation of the logistic likelihood 
with 2m terms, which may be much less than the YlT=i n i produced by flattening. The first 
m terms use response "data" y\ = +1 with multiplicity parameter m+, and the second m 
terms use y\ = — 1 with Ki-. A multiplicity implementation is therefore facilitated by forming 
vectors y' and k' , each of length n = 2m, and using rj" =1 (l + e~ ViXi /3 ) K *. 

The MCMC scheme proceeds as in Section [3]by vectorization. For example, steps 2 and 3 
for Zi and \ would use k! { instead of k. For (3 in step 4 with (a = 0.5, b = k — 0.5) say, replace 
k1„ with the k' vector in the expression for /3. Terms can be eliminated from the likelihood, 
thus eliminating the corresponding latents, where k[ = 0, as is the case when y, e {0,rij}. 
The original, scalar, k is used for the conditionals corresponding to the parameters of the 
prior. For example, the posterior conditional covariance V of /3 is unchanged. 

4 Applications 
4.1 Pima Indian data 



The Pima Indian diabetes data [UCI Machine Learning Repository (Asuncion and Newman 



2007)] includes outcomes for diabetes tests performed on n = 768 women of Pima heritage 
with 8 real- valued predictors. Some of the predictors have many zeros, which may reasonably 
be interpreted as "missing" values. To remain consistent with the treatment of this data 
by HH, and other authors, we do not treat these values in any special way. The following 
analysis highlights properties of regularized estimators of (3 = (f3o = /x, f3\, . . . ,/3s) obtained 
with a = 1, o~j = 1 for j = 1, . . . , 8, and T = 1000 samples from the resulting posterior (the 
first 100 as burn-in). 

Figure summarizes the marginal power posterior(s) for (5 with boxplots. Three settings 
of k G {1, 5, 20} (each panel) were used, and heavy regularization (fixing v = 6) was applied. 
Only the first panel (k = 1) summarizes samples from the true posterior. The k > 1 settings 
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Figure 2: Power-posteriors for the Pima Indian data: v = 6, k G {1, 5, 20}. 



are useful for obtaining other estimators. The MLE, obtained from the glm command in 



R (R Development Core Team, 2009), and the MAP as estimated from the sample(s), are 
also shown. Shrinkage is apparent in the divergence between the MAP and MLE values in 
all panels. Observe how the quartiles and outliers converge on the MAP as k is increased, 
reflecting higher confidence in the accurate estimation of those values. Convergence is par- 
ticularly rapid for the intercept term, and the two coefficients with considerable mass near 
zero (/?4 and /3s). These columns of X have the highest concentration of "missing" values 
Vo and 49% respectively), so it is not surprising the that MAP estimator excludes them. 
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Figure 3: Illustrating the concentration of posterior mass of /3i and /?4 on the Pima Indian 
data for k6 {1,5,10,20} 

Figure [3] illustrates how mass concentrates on the MAP in two disparate cases for varying 
values of k. For fii (left panel), which is decidedly non-zero in the power posterior (s), the 
convergence to the MAP (apparently around /3 2 = 6) is modest. In the case of /3 4 (right 
panel) the convergence to the MAP (to zero) is more rapid as k is increased, allowing for 
confident variable de-selection in a way similar to the lasso for linear regression. 

Finally, we consider the case where v is also inferred by MCMC, jointly with the other 
parameters in the model. We use the IG prior on v with (r = 2,d = 0.1), a typical default 
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choice for linear regression (e.g., Gramacy and Pantaleo, 2010). Figure p] shows the marginal 



o 

C\J 

o 



(D o 



° r 

o J L 



lfc~ 



3= 



i 1 1 1 1 r~ 

10 20 30 40 50 60 



70 



nu 



Figure 4: Concentration of posterior mass of v on the Pima Indian data for k£ {1,5, 10, 20}. 
The histogram extends to v = 100 when k = 1, but the figure is trimmed. 

posterior for v under our settings of k. The rate of convergence is modest, with the spread 
of samples in the k = 20 case being only half that of the k = 1 case. 

4.2 Comparing c/pdf representations on binomial data 

To illustrate the efficient handling of binomial data and, simultaneously, to compare the cdf 
and pdf representations, consider the following simple binomial logistic regression problem. 
The true linear predictor is rji = 1 + xj (3 where (3 = (2, — 3, 2, — 4, 0, 0, 0, 0, 0) T , and the 
p = 9 dimensional x% are uniform in [0, l] p . The responses, yi G {0, ...,7ij}, are sampled 
with yi\xi ~ Bin(nj,//j) where rii = 20 and \ii = e Vi /(l + e Vi ). 



cdf 
pdf 



RMSE (sd) 
flat multi 



0.2117 (0.0602) 
0.2119 (0.0613) 



time (sd) 
flat multi 



0.2120 
0.2121 



(0.0606) 
(0.0602) 



cdf 570.4 (37.8) 
pdf 570.2 (28.7) 



64.6 
64.4 



(0.82) 
(0.99) 



Table 1: Comparing RMSEs (left) and timings in seconds (right) of c/pdf representations 
and flattened/multiplicity treatments of binomial regression modeling. 

Table [T] compares four different implementations of regularized binomial logistic regression 
(a = 1) based on the output of 100 repeated experiments with J^rii = 2000 (i.e., m = 100 
distinct cc, predictors). The metrics for comparison are root mean squared error (RMSE) 
between the true and posterior mean /3s, and overall computing time of the respective MCMC 
samplers. In all cases, we use T = 1000 MCMC rounds with MH sampling of Aj at thinning 
level(s) set by k' (i.e., via n[ for each Aj) as described in Section 3.1 The first 100 rounds 
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were discarded as burn-in. The left table shows that there is no significant difference between 
the cdf and pdf representations, or between the flattened or multiplicity handling of binomial 
data, in terms of RMSE. The right table portrays a more interesting story in terms of CPU 
times. The many fewer latent variables needed by the multiplicity implementation leads to a 
much (9x) faster execution compared to flattening, with no cost in accuracy (via RMSE). In 
contrast, there is no speed gain to using n fewer latent Zi variables in the pdf representation. 





Figure 5: Comparing MH (left) and slice (right) samplers for a Aj in the pdf representation. 



Figure [5] illuminates the differences in behavior between the MH and slice sampler for 
the Aj draws (in the pdf representation). A particularly "sticky" case, as chosen from output 
of the experiment, had k[ = 14. The top panel shows that many proposals from q a & can 
be rejected under the MH ratio, even when the chain is automatically thinned. The bottom 
panel shows the chain obtained for the same A, under the slice sampler, which never saves 
any rejected draws. However, this comes at the expense of many rejections in the inner-loop 
of the slice, resulting in a slow overall sampler. The median was four, but the mean was 81 
owing to a heavy right-hand tail in the distribution of rejections whose central 95% quantile 
spanned to 114 and maximum reached 140,600. The overall MCMC scheme based on the slice 
sampler took four times longer than the one based on MH. Despite the absence of rejections, 
the mixing in slice sampler chain (assessed visually) was no better than MH. Indeed, their 
effective sample size due to autocorrelation ( |Kass et al. 1998) was nearly identical: 223 for 
slice sampling, and 221 for MH. Therefore, MH is recommended for speed considerations. 



4.3 A simulated p ^> n experiment 

We turn now to a predictive comparison of the methods of this paper, both fully Bayesian and 
full/joint MAP (including is), benchmarked against other modern approaches to regularized 



logistic regression. Consider a synthetic data experiment like the one in Section 4.2 except: 
rii = 5 for each of 20 unique predictors x%, so that Yl n i = 100- Three variations on the data- 
generating (3 vectors were used. In the first case p = 9 and (3 = (2, —3, 0.74, —0.9, 0, 0, 0, 0) T ; 
in the second case p = 100, augmenting /3 from the first case with 91 more zeros; and in the 
third p = 1000 with 900 more zeros still. Each experiment involves a new random training 
design in the unit p-cube. Random testing set are created similarly, except that n! { = 100 
so Y2 n 'i = 10000. The metrics of comparison are (approximated) expected log likelihood 
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(ELL n and misclassification rates. 

Fully Bayesian posterior mean estimators (i.e., k — 1) are derived via priors/MCMC 
exactly as described in the preceding sections with (100,1000), (500,1500), (1000,2000) 
burn-in and total MCMC rounds in each of the cases p = 9, 100, 1000, respectively. MAP 
estimators are found by running a k = 10 chain initialized at (/3, A, z/)-values from the k — 1 
chain used for the mean estimators, except in the p = 1000 case where v was fixed to its 



posterior mean for reasons laid out in Section 3.2| Comparators include: the MLE obtained 



via the glm command in R; a binomial fit from the glmnet package (Friedman et al. , 2010); 



and the estimator of Krishnapuram et al. (2005) 4 ["krish" for short]. The MLE was unstable 



in the p = 100 & 1000 cases, so these results were omitted. CV was used to choose the 
penalty parameter in the p = 9 & 100 cases for glmnet, via cv. glmnet. The same procedure 
gave fatal errors in the p = 1000 case so we plugged in the estimate obtained from the 
corresponding p = 100 run in for this final case. Reliably setting the penalty parameter for 
"krish" , via CV or otherwise, was too computationally intensive for the p = 100, 1000 cases 
so we picked a setting by hand using out-of-sample simulations from the p = 9 case. 

The results of the Monte Carlo experiment are summarized in Figure [6] by boxplots, 
and numerically. The best estimators have high ELL, low miss rates, and lower variability 
across the 100 repetitions. The fully Bayesian and "krish" methods are the best when p = 9 
(left-hand region of the boxplots and the top region of the tables). The former wins by 
ELL, having fewer low values, and the latter wins on miss rate, having more small ones. The 
"krish" method wins by both metrics on average, since it employs a fortuitously hand-chosen 
setting of the penalty parameter. The MLE is good on average, but has some extreme ELL 
and miss rate values. The glmnet and MAP estimators are positioned in between. 

Distinctions in performance between the methods increases with p. See the right-hand 
regions of the boxplots and the bottom regions of the tables. The "krish" method suffers from 
high variability due to the fixed choice of the penalty parameter. The glmnet variability is 
much lower, but there are many extreme outliers. Behavior in both p = 100 and 1000 cases 
is qualitatively similar for this estimator even though the former used CV to set the penalty 
parameter and the latter used the same fixed value. The MAP and fully Bayesian estimators 
have similar average behavior compared to other estimators, but with lower variability. 
Apparently, choosing the penalty parameter via the posterior offers the most stability in 
high dimensional settings. The fully Bayesian approach appears preferable to the MAP in 
all cases, but this distinction is harder to make out as p increases. 

4.4 Spam data with interactions 

For a similar real-data experiment, consider the Spambase data set from UCI. It contains the 
binary classifications of 4601 emails based on 57 attributes which are treated as predictors. 
An interaction-expanded version of the predictor set contains approximately 1700 predictors. 



3 Specifically, the average of (1 — pi) log(l — pi) + pi logpi over all testing locations i, where pi and pi are 
the true and estimated predictive probabilities of the first label, respectively. 



This is equivalent to the Genkin et al. (20071 estimator but computationally less efficient. 
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Figure 6: Expected log likelihood (ELL) and misclassification rates in boxplot (left) and 
tabular (right) form. In both cases there are three sections, depending on the number of 
irrelevant predictors in the design matrix, wherein the same estimators are applied. The 
vertical dashed-red lines in the boxplots indicate the same demarkation as the horizontal 
lines in the tables. 

We performed a Monte Carlo experiment comprising of 20 random 5-fold CV training and 
testing sets using both the original and expanded predictors. Estimators were fit on the 
100 training sets, and validated by misclassification rate on the testing ones. The Bayes 
estimators used (500,1500) MCMC (burn-in, total) rounds with the original 57 attributes, 
and (1000,2000) with the expanded set. The MAP and glmnet calculations were exactly 
as described for the p = 100 case in Section 4^3 for the original predictor set, and like the 
p = 1000 case for the expanded one. And "krish" was like p = 9 and p = 100, respectively. 
The results of the experiment are summarized in Figure [7j The first thing we notice is 
that, in contrast with the results in Section |4~3| the performance improves as the predictor 



set expands since some of the interaction terms make good predictors. The MLE is unstable, 
and so the regularized estimators offer an improvement even when the number of predictors 
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Figure 7: Misclassification rates in boxplot (left) and tabular (right) form. In both cases 
there are two sections, depending absence or presence of interaction terms in the design 
matrix, wherein the same estimators are applied. The vertical dashed-red line in the boxplot 
indicates the same demarkation as the horizontal line in the table. 

is small relative to the number of instances. The Bayesian methods unilaterally outperform 
glmnet, and using the posterior to set the value of the regularization parameter is important 
in high dimensional settings. The "krish" estimator with fortuitous regularization is the 
best on the original predictor set, but worst on the expanded one where a revised setting of 
regularization could not be automated efficiently 

5 Discussion and extension 

We provide a simulation-based approach to regularized logistic regression that facilitates 
a variety of inferential goals under a single framework. Most of the development of the 
methodology, and all of the applications, involved the a = 1 case. Everything extends to the 
ridge prior (a = 2), i.e., an independent normal prior for each coefficient (3j with variance 
crJz/ 2 /K. Then, p K (u)j\j3, v) is a point mass at Uj = 1. Thus similar conjugacy results hold for 
the gamma prior on v and v 2 . 

From a computational perspective, our methods are competitive with the state-of-the 
art in un-regularized (and k — 1) contexts too. For example, we compared the efficiency of 
our methods to the "dRUM" MH sampler described by |Fruhwirth-Schnatter and Fruhwirth 



(2010). This method is attractive because it is fast and easy to implement. For example, 



on the Pima data it takes about 32s to generate 10,000 samples from the posterior which is 
about 7x faster than our pdf representation, which took 230s. However, the MH acceptance 
rate of the dRUM method was 46% which lead to an marginal ESS of 957 averaged over 
the nine (3j coefficients. Our pdf representation had an average ESS that was about 5x 
better, at 4518. So the methods work out to have similar overall efficiences in that example. 
But in higher dimension like the 57-d spam data, our Gibbs sampling approach is much 
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more attractive. The acceptance rate for dRUM was extremely low at 0.4%, which leads to 
ESSs that are essentially nil. Although our pdf representation is (again) 7x slower, faster 
convergence due to better movement in the chain leads to reasonable ESSs around 500. 

There are several extensions of our methodology that readily present themselves. For 
example, handling polychotomous data (i.e., > 2 classes) is straightforward. Following 
the setup in HH we may introduce C collections of coefficients /3^\ . . . ,/3^ for C classes 
with the convention that (3^ = so that logistic regression is recovered in the C = 2 
case. Then, we simply work with the conditional likelihoods L(f3^\y, (3^^) which turn 
out to have exactly the form of a logistic regression likelihood for the class indicator that 
each yi = j, independently for i — 1, . . . ,n. If there are n« > 1 trials for predictors x iy 
then our algorithm for binomial logistic regression is applicable via a vectorized multiplicity 



parameter as described in Section 3.3 Extending the methods to ordinal responses is even 
easier. Johnson and Albert ( 1999| Chapter 4) describe a Bayesian probit model which 
may be adapted for the logit case following either HH or our cdf representation. The pdf 
representation may not be readily applicable because the latent Z{ are useful for efficient 
sampling of the ordinal break points. 

An further direction is to other classes of regularization priors. Implementing the Normal- 



Gamma extension (Griffin and Brown, 2010) requires adding an extra (conjugate) parameter. 
A promising new approach is the horseshoe prior (Carvalho et al. , 2010), which can be 
implemented with the addition of a slice sampler. Often variable selection is a primary 
goal of regularization, for which our methods would require further extension. For example, 
HH describe an approach to variable selection for logistic regression via Reversible Jump 



MCMC (Green, 1995) which is adaptable to our framework. A similar regularized approach 
in a linear regression is provided b}Gramacy and Pantaleo (2010). For variable selection for 



logistic regression using spike-and-slab priors, see Tiichler (2008). 



Acknowledgments 

This research was partially funded by EPSRC grant EP/D065704/1 to RBG. The authors 
would like thank Matt Taddy for interesting discussions on the efficient handling of Binomial 
data, extensions to Multinomial regression, and EM code for the MAP estimator (s). We are 
grateful to two referees and an associate editor for valuable comments. 

A Posterior conditional for /3 in the pdf representation 



For a particular A, i.e., ignoring the integral in Eq. (13), we have the following expression 
for the likelihood in vector/matrix form. 



IK 1 * 



-VixJ P 



1=1 



<•'" x 'rx V {-U(y.X)P+±(a-b)\\ K- 1 ({y.X)^ l -{a-b)\ 



An expression for the posterior conditional for can then obtained by multiplying by the 
kernel of the MVN prior given uj, provided below Eq. (15), namely: exp{ — |/3 T (^2-S _1 fi~ 1 /3)}. 
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Combining the terms in the three exponents gives the following quadratic form. 

-2ay T XP + ((y.X)p + i(o - 6)a) A" 1 ((y.X)P + 1 -(a- b)\\ + /3 T f^E" 1 ^ 1 ) 

Collecting terms for /3 yields 

/3 T ( {y.XyX-\y.X) + ^E^Q- 1 ) /3 - (2c«/ T X - (a - 6)( 2/ .X)A" 1 A)/3. 



Therefore we deduce that the conditional is Af p ((3,V) where V 1 = (y.X) T A l (y.X) + 
^S- 1 ^- 1 . Recognizing that {y.X)A- l X = X T y gives that J3 = V{a - \[a - b])X T y. 



B Generalized Inverse Gaussian distribution 

The pdf of a Generalized Inverse Gaussian, GIG(A, x, i>) is 

(ip/x) X/2 x-i f 1 

9(x;X,x,ip) = „ , AT ^ exp<^ --{rpx + x/x) 

where Xa is a modified Bessel function of the second kind. If X ~ GIG(|,%, ?/>) then 
X -1 ~ IN(/i = \/ip/Xi ^ = '0) where where IN is the inverse Gaussian distribution with pdf 



,. A j A(x-/i) 2 

/(x;/i ' A) = V^ exp \ — v^ 

The mean and variance are E{x} = \x and Var[x] = /i 3 /A. A generalized inverse Gaus- 
sian GIG \\-,Xity) is an inverse of an Inverse Gaussian. For simulation from GIG and IN 



distributions see Devroye (1986). 
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